Homework 1 Instructions
DKU Stats 101 Spring 2024 Session 4
Assignment Background
Imagine that you are an analyst at a consulting firm hired by a boat dealer who is interested in trying to make money buying undervalued boats and reselling them to the global market. Your boss asks you to write a report that identifies the key features of boats that best determine their price. She also sends you this dataset of recent boat sales on the website boattrader.com. Use this dataset to answer the questions below. 1
Assignment Instructions
- Please leave the name as Anonymous; the homeworks are blind graded so please do not include your name.
- Save this document as a new document (Save As…) and rename it
Homework 1 Answers.qmd
. - You can import the dataset by inserting the code
boats <- read.csv("boats.csv")
into your setup block. Make sure the data file is in the same folder as your homework file.
- If I say “Interpret…” or “Explain…” that means I want at least 1-2 good quality sentences that show that you really understand the output of what R has produced. Short, incomplete sentences that fail to demonstrate you understand your output will have points deducted.
- While the homework isn’t a formal document, it should be written as if you are a business analyst presenting this information to a boss - i.e. everything should look neat and tidy, have good labels (graphs have appropriate titles and axes labels, etc.), and be well organized
- If you want a chance to earn extra credit, select your best graph or table (only one per student) and post it to the
Graph Contest
Teams channel. I will select a few finalists and the class will vote on the best data display. Winner receives significant extra credit. - Delete the Assignment Background and Assignment Instructions sections of this document before submitting.
Part 1: One variable analysis
Q1: What kind of dataset do we have? (5 points)
According to the definitions in the textbook, describe the Five W’s for this dataset.
Using the definitions in the textbook, describe the variable type for the following variables:
id
type
boatClass
year
condition
length_ft
beam_ft
dryWeight_lb
price
sellerId
Q2: Literature review (5 points)
Find a news article online that discusses what are the major features that determine what people are looking for boats and include the link in this section.
Based on the article and your own personal expectations, what are some ways we might expect the data to be distributed or variables related? Make a list of at least three things we should expect or look for in the data and write a reason why we should expect it (no need to cite academic papers, just write down your reasons). Reasons should be thoughtful and at least two sentences explaining your logic for the expectation.
Q3: Describing the data (10 points)
The first step in analyzing any dataset is doing some exploratory analysis of the variables.
- Make a histogram of
price
.- Describe it using the three features of quantitative data.
- Does the histogram of
price
surprise you? Why or why not? - Which is a better measure of center of the histogram, mean or median?
- Make a nice table displaying the 5 number summary. You can make a nice table using either kable or with the visual editing mode of Quarto. Calculate the five number summary by using the
min()
,quantile()
,median()
, andmax()
functions to do this. Show your code in the document (echo: true
). - Calculate the standard deviation using the
sd()
function. Interpret it - is it large or small? How does it compare to the IQR? What does this tell you about the shape of the distribution? - Would this histogram benefit from a transformation, in your opinion? Why or why not? If it would, please transform it appropriately, make a new histogram, and describe the transformation.
- Make a boxplot chart comparing the median of
price
according to the variablecondition
. If you previously transformed your data, keep it transformed for this step. Interpret this graph.- Hint: R imports datasets by default keeping text as text (
strings
in R world). Remember the DataCamp method to convert thestrings
intofactors
.
- Hint: R imports datasets by default keeping text as text (
Q4: Comparing categorical variables (5 points)
- One interesting piece of information the dealer would like to know is if there is any change in the types of boats being sold according to what type of fuel it uses. In particular, they wonder if some fuel types are becoming less popular and therefore more difficult to sell in new boats. The best way to examine this relationship is with a contingency table.
- Add margins to your table. Does it change your interpretation?
- Now convert your table into a proportions table. Does this better help explain what the data show?
- What should you recommend to the dealer? What are some additional questions or areas of research do the results suggest you should look into?
Q5: Understanding and comparing distributions (5 points)
- Using the five number summaries, calculate if
length_ft
has any outliers according to the rule described in the textbook for outliers in boxplots. Show your calculations. Do you believe the outliers identified are real outliers? Why or why not? Consider the purpose of your report when preparing your answer. - Create a graph of boxplot of
length_ft
byfuelType
. What can you conclude from this display? Would any of these subgroups benefit from havinglength_ft
re-expressed? Why or why not?
Q6: The Normal distribution (5 points)
For boats under 20 feet, a rough rule of thumb is that the maximum weight capacity in kilograms of a boat is approximately the length in feet times the width (or beam) of the boat times 5. In a 2006 study of Americans, the average weight was approximately 70 kilos with a standard deviation of 11 kilos.
- In formal notation, write the Normal model of human weight.
- Select at least five boats that are under 20 feet from the database. Make a table of the boats.
- The title of the columns of the table should be z scores ranging from -2 to +2
- Calculate what is the maximum number of passengers the boat can hold if all of the passengers each weigh the following z scores: (-2, -1, 0, 1, 2), show your calculations below your table.
- If you are a boat maker, what cutoff would you set for boat passenger capacity and why?
- Look up three of the boats in your table on the internet and write down what the manufacturer set as the recommended number of people on the boat. What z score cutoff does it seem the boatmaker made in each of the cases?
Part 2: Two variable analyis
Q7: Relationship between variables (15 points)
- Make a scatterplot of
price
as a function oflength_ft
. Add a linear smoother to the plot and label any points you consider to be an outlier usinggeom_text()
- the label for the outlier should print the observation’sid
. If necessary, transform any variables- Do you think there is a clear pattern? Describe the association between
price
andlength_ft
. - Find out the details of any outliers you have identified. Do you think the outlier(s) should be excluded from the analysis? Why or why not?
- Do you think there is a clear pattern? Describe the association between
- Make a second graph excluding any outliers you have identified
- What do you estimate the correlation to be, without using technology?
- Check the conditions for correlation
- Find and interpret the correlation coefficient for this relationship
- Interpret this graph. What is some useful information we could communicate to the client? What questions for additional investigation does this graph prompt for you?
- Now, make a third graph of the same two variables but color it by
condition
. Add a linear smoother for each condition.- How does this graphical display change your interpretation you developed in your answer to part 6? Why do you think you the relationship is structured like this? Explain.
Q8: Putting it all together (15 points)
Through the analysis conducted in the previous section and through at least one additional investigation of your own (which can be an additional graph or table, that analyzes a different relationship or distribution than one asked about in the questions above but you think is meaningful and important to communicate the client), write three paragraphs outlining what you think are the main findings of questions 1-7 plus your additional investigation. What would you recommend to your boss as to what types of boats sell for the most money? What information are we missing in this dataset that we would need to better understand boat price?
Footnotes
Credit to user MEXWELL for making this data publicly available at https://www.kaggle.com/datasets/mexwell/boat-price-prediction↩︎